Getting Structured Data from the Internet: Running Web CrawlersScrapers on a Big Data Production Scale by Jay M. Patel
Author:Jay M. Patel [Patel, Jay M.]
Language: eng
Format: epub
ISBN: 9781484265765
Publisher: Apress
from sklearn.metrics import silhouette_samples
y_km = km.fit_predict(X_train_text)
pd.Series(y_km).value_counts().to_dict()
# Output
{1: 395, 7: 292, 4: 289, 6: 242, 2: 158, 3: 149, 0: 145, 5: 110}
Listing 4-42Kmeans clustering
We checked top terms per cluster in Listing 4-43 since all the preceding clusters seem to have quite a balanced number of members. We can add cluster numbers as a column to the document term matrix dataframe and filter the dataframe to show documents from individual clusters. Once we have a filtered dataframe, it's just a matter of adding up token weights, transposing it, and sorting it in descending order to display top 30 terms from each cluster.df_dtm["cluster_name"] = y_km
df_dtm.head()
cluster_list = len(df_dtm['cluster_name'].unique())
for cluster_number in range(cluster_list):
print("*"*20)
print("Cluster %d: " % cluster_number)
df_cl = df_dtm[df_dtm['cluster_name'] == cluster_number]
df_cl = df_cl.drop(columns = 'cluster_name')
print("Total documents in cluster: ", len(df_cl))
print()
df_sum = df_cl.agg(['sum'])
df_sum = df_sum.transpose()
df_sum_transpose_sort_descending= df_sum.sort_values(by = 'sum', ascending = False)
df_sum_transpose_sort_descending.index.name = 'words'
df_sum_transpose_sort_descending.reset_index(inplace=True)
print(','.join(df_sum_transpose_sort_descending.words.iloc[:30].tolist()))
# Output
********************
Cluster 0:
Total documents in cluster: 145
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Biomathematics | Differential Equations |
Game Theory | Graph Theory |
Linear Programming | Probability & Statistics |
Statistics | Stochastic Modeling |
Vector Analysis |
Modelling of Convective Heat and Mass Transfer in Rotating Flows by Igor V. Shevchuk(6184)
Weapons of Math Destruction by Cathy O'Neil(5723)
Factfulness: Ten Reasons We're Wrong About the World – and Why Things Are Better Than You Think by Hans Rosling(4430)
Descartes' Error by Antonio Damasio(3119)
A Mind For Numbers: How to Excel at Math and Science (Even If You Flunked Algebra) by Barbara Oakley(3052)
Factfulness_Ten Reasons We're Wrong About the World_and Why Things Are Better Than You Think by Hans Rosling(3003)
TCP IP by Todd Lammle(2960)
Applied Predictive Modeling by Max Kuhn & Kjell Johnson(2830)
Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets by Nassim Nicholas Taleb(2805)
The Tyranny of Metrics by Jerry Z. Muller(2785)
The Book of Numbers by Peter Bentley(2718)
The Great Unknown by Marcus du Sautoy(2492)
Once Upon an Algorithm by Martin Erwig(2426)
Easy Algebra Step-by-Step by Sandra Luna McCune(2416)
Lady Luck by Kristen Ashley(2362)
Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2) by Alboukadel Kassambara(2342)
Police Exams Prep 2018-2019 by Kaplan Test Prep(2314)
All Things Reconsidered by Bill Thompson III(2220)
Linear Time-Invariant Systems, Behaviors and Modules by Ulrich Oberst & Martin Scheicher & Ingrid Scheicher(2187)
